Optional Stopping Research Articles

I would first like to congratulate Dr. Anderson-Cook and Dr. Lu for an important, insightful, and timely paper. I'm honored and grateful to have been invited to discuss the paper and share my thoughts on this topic. Although I must admit, I find it difficult to write a discussion because I agree very strongly with their arguments. In particular, I too emphatically believe that designed data collection remains relevant in the era of Big Data. In what follows I offer my thoughts on opportunities for designed experiments in this modern era, amplifying and augmenting some mentioned in the main paper, and also remarking on others not addressed. Drs. Anderson-Cook and Lu advocate for the use of design of experiment (DOE) principles in the execution and analysis of simulation studies. Chipman and Bingham1 have recently provided similar guidance, acknowledging that the simulation studies now ubiquitous in methodological statistics and data science research tend to be factorial experiments, whether the researchers identify them as such or not. Chipman and Bingham1 describe the value of using fractional factorial designs to more efficiently study the simulation factors, and they demonstrate how ANOVA and main and interaction effect plots may be used to provide deeper insights into the results of a simulation. In the interest of developing widely useful methodology, they also encourage the use of robust parameter designs for simulations to study the robustness of methodological developments to uncontrollable factors. In a similar vein, another contemporary application of experimental design principles is in machine learning hyperparameter optimization. The models and algorithms associated with machine learning and deep learning efforts typically rely on several (in some cases dozens of) hyperparameters and interest lies in identifying the configuration of those parameters that optimizes predictive performance. Most often, hyperparameter tuning is done via grid search (i.e., factorial designs) but this problem is perfectly suited for response surface methodology to both efficiently determine which hyperparameters are important (via screening experiments) and then to determine the optimal configuration of the important ones via response surface experiments and second order models. Lujan-Moreno et al.,2 Zhang et al.,3 and Pannakkong et al.4 illustrate these ideas in the contexts of random forests, support vector machines, and neural networks, but this is still a burgeoning application area; there is no clear consensus on how hyperparameter values should be tuned, so more work and wider advocacy of DOE methods in this area is warranted.5 Another DOE application of interest to data scientists that Drs. Anderson-Cook and Lu identified is online controlled experimentation (OCE). Such experiments, colloquially referred to as A/B tests, are typically performed to improve and innovate internet-based products and services, ultimately aiming to maximize revenue. Large technology companies are running thousands of these experiments each year, hundreds concurrently per day, and engaging millions of users.6 If you used the internet today, you were almost certainly in one of these experiments. See Gupta et al.,6 and Bojinov and Gupta7 for practical overviews of the process, benefits, challenges, and culture of online experimentation. While this type of experimentation is often overlooked by academic statisticians and mischaracterized as simple two-group comparisons, as the aforementioned references indicate, practical issues and modern challenges abound, necessitating sophisticated statistical solutions. OCEs provide statisticians with a host of new opportunities for methodological and theoretical development. See Larsen et al.8 for a comprehensive review of the statistical challenges and open research problems in this area. Depending on the nature of the innovation and the problem being solved, many different experimental designs may be appropriate. The specific version of OCEs described by Drs. Anderson-Cook and Lu concern adaptive treatment assignment and the “explore vs. exploit” tradeoff that typify multi-armed bandit experiments.9, 10 Popularized by Google, this version of experimentation seeks to explore multiple treatment arms to determine which is optimal, while simultaneously exploiting those that have been performing well throughout the experiment. Such an adaptive approach is notorious for increased type I error rates, but proponents suggest that type I errors are of little relevance in this setting, and by contrast type II errors are much more costly.9 This form of experimentation is related to traditional sequential hypothesis testing, though it differs in its regard for type I errors. Sequential analyses and optional stopping methods that do prioritize type I errors are still relevant, however. See Johari et al.11, 12 for an extension of the sequential probability ratio test,13 adapted specifically for the context of OCEs. This methodology is the analysis backbone at Optimizely, one of the largest third-party experimentation platform vendors.14 See also Deng et al.15 for a Bayesian approach to continuously monitor OCEs using Bayes factors. It's important to note, however, that multi-armed bandits and other sequential methods are just one facet of the broader field of OCEs. Not all such experiments are concerned with the exploration-exploitation tradeoff. In other settings it may be of interest to estimate long-term treatment effects,16-18 or estimate heterogeneous treatment effects,19, 20 or estimate average treatment effects in the presence of network interference.21-23. It's also important to recognize that even though these experiments may engage thousands of users generating thousands of data points (providing what Drs. Anderson-Cook and Lu described as “tall, narrow” data), the issue of power is very relevant; the sentiment that OCEs do not suffer from inadequate sample sizes is misconceived.24, 25 Despite enormous sample sizes (relative to more traditional experimental applications), these experiments tend to seek out very small treatment effects, and are typically plagued with very noisy data. Variance-reduction techniques like control variates26 and triggered analyses27, 28 are therefore commonplace. As this brief overview has hopefully illustrated, there are many interesting statistical problems associated with OCEs and ample opportunity for DOE researchers to contribute. With all of these modern DOE applications (the few described here and the many more reviewed by Drs. Anderson-Cook and Lu), it's imperative that experimental design principles remain prominent in statistics and data science curricula so that our students appreciate the value of DOE and possess the knowledge necessary to apply these ideas in the many Big Data endeavours they will undoubtedly encounter in their careers. It's my belief that DOE courses in traditional statistics programs should be modernized and rejuvenated to expose students to these many interesting problems and impactful opportunities. Likewise, given the relevance of DOE to the data scientist's toolbox, it's also my opinion that each of the many data science programs popping up at universities around the world should have a dedicated DOE course. Dr. Stevens is an Assistant Professor of Statistics in the Department of Statistics and Actuarial Science at the University of Waterloo. His research interests lie at the intersection of data science and industrial statistics; his publications span topics including experimental design and A/B testing, social network modeling and monitoring, survival and reliability analysis, measurement system analysis, and the development of estimation based alternatives to traditional hypothesis testing. None.

Bayesian t tests have become increasingly popular alternatives to null-hypothesis significance testing (NHST) in psychological research. In contrast to NHST, they allow for the quantification of evidence in favor of the null hypothesis and for optional stopping. A major drawback of Bayesian t tests, however, is that error probabilities of statistical decisions remain uncontrolled. Previous approaches in the literature to remedy this problem require time-consuming simulations to calibrate decision thresholds. In this article, we propose a sequential probability ratio test that combines Bayesian t tests with simple decision criteria developed by Abraham Wald in 1947. We discuss this sequential procedure, which we call Waldian t test, in the context of three recently proposed specifications of Bayesian t tests. Waldian t tests preserve the key idea of Bayesian t tests by assuming a distribution for the effect size under the alternative hypothesis. At the same time, they control expected frequentist error probabilities, with the nominal Type I and Type II error probabilities serving as upper bounds to the actual expected error rates under the specified statistical models. Thus, Waldian t tests are fully justified from both a Bayesian and a frequentist point of view. We highlight the relationship between Bayesian and frequentist error probabilities and critically discuss the implications of conventional stopping criteria for sequential Bayesian t tests. Finally, we provide a user-friendly web application that implements the proposed procedure for interested researchers. (PsycInfo Database Record (c) 2024 APA, all rights reserved).

Optional Stopping Research Articles

Related Topics

Articles published on Optional Stopping

Application of the Concept of Statistical Causality in Integrable Increasing Processes and Measures

On near-martingales and a class of anticipating linear stochastic differential equations

Generic E-variables for exact sequential [formula omitted]-sample tests that allow for optional stopping

Footprint of a topological phase transition on the density of states

Q‐Learning model for selfish miners with optional stopping theorem for honest miners

Is designed data collection still relevant in the big data era? – A discussion

Exact anytime-valid confidence intervals for contingency tables and beyond

Human optional stopping in a heteroscedastic world.

On De la Peña Type Inequalities for Point Processes

A Collection of Results Relating the Geometry of Plane Domains and the Exit Time of Planar Brownian Motion

Waldian t tests: Sequential Bayesian t tests with controlled error probabilities.

Valid sequential inference on probability forecast performance

Log-optimal anytime-valid E-values

Biasing the input: A yoked-scientist demonstration of the distorting effects of optional stopping on Bayesian inference.

Bounds on expected propagation time of probabilistic zero forcing

Worked-out examples of the adequacy of Bayesian optional stopping.

The Perils of Misspecified Priors and Optional Stopping in Multi-Armed Bandits.

Optional Stopping with Bayes Factors: A Categorization and Extension of Folklore Results, with an Application to Invariant Situations

Does preregistration improve the credibility of research findings?

The Gambler's Ruin Problem and Quantum Measurement

Lead the way for us